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Abstract 

It is becoming increasingly important for machine learn¬ 
ing methods to make predictions that are interpretable as 
well as accurate. In many practical applications, it is of in¬ 
terest which features and feature interactions are relevant 
to the prediction task. We present a novel method, Selec¬ 
tive Bayesian Forest Classifier, that strikes a balance be¬ 
tween predictive power and interpretability by simultane¬ 
ously performing classification, feature selection, feature 
interaction detection and visualization. It builds parsimo¬ 
nious yet flexible models using tree-structured Bayesian 
networks, and samples an ensemble of such models using 
Markov chain Monte Carlo. We build in feature selection 
by dividing the trees into two groups according to their rel¬ 
evance to the outcome of interest. Our method performs 
competitively on classification and feature selection bench¬ 
marks in low and high dimensions, and includes a visual¬ 
ization tool that provides insight into relevant features and 
interactions. 

1. Introduction 

Feature selection and classification are key objectives in 
machine learning that are usually tackled separately. How¬ 
ever, performing classification on its own tends to produce 
black box solutions that are difficult to interpret, while 
performing feature selection alone can be difficult to jus¬ 
tify without being validated by prediction. In addition to 
screening for relevant features, it is also useful to detect 
interactions between them. In many decision support sys¬ 
tems, e.g. in medical diagnostics, the users care about 
which features and interactions contributed to a particu¬ 
lar decision made by the system. Selective Bayesian For¬ 
est Classifier (SBFC) combines predictive power and inter¬ 
pretability by performing classification, feature selection, 
and feature interaction detection at the same time. Our 
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Figure 1: Example of a SBFC graph 

method also provides a visual representation of the rele¬ 
vance of different features and feature interactions to the 
outcome of interest. 

The main idea of SBFC is to construct an ensemble of 
Bayesian networks [Pearl, 1988], each constrained to a for¬ 
est of trees divided into signal and noise groups based on 
their relationship with the class label Y (see Figure 1 for 
an example). The nodes and edges in Group 1 represent 
relevant features and interactions. Such models are easy 
to sample using Markov chain Monte Carlo (MCMC). We 
combine their predictions using Bayesian model averaging, 
and aggregate their feature and interaction selection. 

We show that SBFC performs competitively with state- 
of-the-art methods on 25 low-dimensional and 6 high¬ 
dimensional benchmark data sets. By adding noise fea¬ 
tures to a synthetic data set, we compare feature selection 
and interaction detection performance as the signal to noise 
ratio decreases (Figure 5). We use a high-dimensional 
data set from the NIPS 2003 feature selection challenge 
to demonstrate SBFC’s superior performance on a diffi¬ 
cult feature selection task (Figure 6), and illustrate the 
visualization tool on a heart disease data set with mean¬ 
ingful features (Figure 4). SBFC is a good choice of 
algorithm for applications where interpretability matters 
along with predictive power (an R package is available at 
github . org/vkrakovna/sbf c). 
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2. Related Work 

Tree structures are frequently used in computer science 
and statistics, because they provide adequate flexibility to 
model complex structures, yet are constrained enough to 
facilitate computation. SBFC was inspired by tree-based 
methods such as Tree-Augmented Naive Bayes (TAN) 
[Friedman et al., 1997] and Averaged One-Dependence Es¬ 
timators (AODE) [Webb et al., 2005]. TAN finds the opti¬ 
mal tree on all the features using the minimum spanning 
tree algorithm, with the class label Y as a second parent for 
all the features. While the search for the best unrestricted 
Bayesian network is usually an intractable task [Hecker- 
man et al., 1995], the computational complexity of TAN 
is only 0(d 2 n ), where d is the number of features and n 
is the sample size [Chow & Liu, 1968]. AODE constrains 
the model structure to a tree where all the features are chil¬ 
dren of the root feature, with Y as a second parent, and uses 
model averaging over model with all possible root features. 
These methods put all the features into a single tree, which 
can be difficult to interpret, especially for high-dimensional 
data sets. We extend on TAN and AODE by building forests 
instead of single-tree graphs, and introducing selection of 
relevant features and interactions. 

Feature selection is often used as a preprocessing step for 
classification algorithms. Wrapper methods [Kohavi & 
John, 1997] select a subset of features tailored for a specific 
classifier, treating it as a black box. Variable Selection for 
Clustering and Classification (VSCC) [Andrews & McNi- 
cholas, 2014] searches for a feature subset that simultane¬ 
ously minimizes the within-class variance and maximizes 
the between-class variance, and remains efficient in high 
dimensions. Categorical Adaptive Tube Covariate Hunting 
(CATCH) [Tang et al., 2014] selects features based on a 
nonparametric measure of the relational strength between 
the feature and the class label. 

Our approach, however, is to integrate feature selection 
into the classification algorithm itself, allowing it to influ¬ 
ence the models built for classification. A classical exam¬ 
ple is Lasso [Tibshirani, 1996], which performs feature se¬ 
lection using Li regularization. Some decision tree clas¬ 
sifiers, like Random Forest [Breiman, 2001] and BART 
[Chipman et al., 2010], provide importance measures for 
features and the option to drop the least significant fea¬ 
tures. In many applications, it is also key to identify rel¬ 
evant feature interactions, such as epistatic effects in ge¬ 
netics. Interaction detection methods for gene associa¬ 
tion models include Graphical Gaussian models [Andrei 
& Kendziorski, 2009] and Bayesian Epistasis Association 
Mapping (BEAM) [Zhang & Liu, 2007]. BEAM intro¬ 
duces a latent indicator that partitions the features into sev¬ 
eral groups based on their relationship with the class label. 
One of the groups in BEAM is designed to capture relevant 


feature interactions, but is only able to tractably model a 
small number of them. SBFC extends this framework, us¬ 
ing tree structures to represent an unlimited number of rel¬ 
evant feature interactions. 

3. Selective Bayesian Forest Classifier (SBFC) 

3.1. Model 

Given n observations with class label Y and d discrete fea¬ 
tures Xj, j = 1,..., d, we divide the features into two 
groups based on their relation to Y : 

Group 0 (noise): features that are unrelated to Y 
Group 1 (signal): features that are related to Y 

We further partition each group into non-overlapping sub¬ 
groups mutually independent of each other conditional on 
Y. For each subgroup, we infer a tree structure describing 
the dependence relationships between the features (many 
subgroups will consist of one node and thus have a trivial 
dependence structure). Note that we model the structure in 
the noise group as well as the signal group, since an inde¬ 
pendence assumption for the noise features could result in 
correlated noise features being misclassified as signal fea¬ 
tures. 

The overall dependence structure is thus modeled as a for¬ 
est of trees, representing conditional dependencies between 
the features (no causal relationships are inferred). The class 
label Y is a parent of every feature in Group 1 (edges to Y 
are omitted in subsequent figures). We will refer to the 
combination of a group partition and a forest structure as a 
graph. 

The prior consists of a penalty on the number of edges be¬ 
tween features in each group and a penalty on the number 
of signal nodes (i.e., edges between features and Y) 

P(G) oc d~^ E °( G )+ E i(G)M- D i(G)/v 

where Di ( G ) is the number of nodes and Ei{G ) is the num¬ 
ber of edges in Group i of graph G, while v is a constant 
equal to the number of classes. 

The prior scales with d , the number of features, to penal¬ 
ize very large, hard-to-interpret trees in high dimensional 
cases. The terms corresponding to the signal group are di¬ 
vided by v 9 the number of possible classes, to avoid pe¬ 
nalizing large trees in the signal group more than in the 
noise group by default. The coefficients in the prior were 
found in practice to provide good classification and feature 
selection performance (there is a relatively wide range of 
coefficients that produce similar results). 

Given the training data X ( nxd ) (with columns Xj , j = 
1,..., d) and y^ nXl y we break down the graph likelihood 
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Table 1: Parent sets for each feature type 


Type of feature Xj 

Parent set A j 

Group 0 root 
Group 0 non-root 
Group 1 root 
Group 1 non-root 
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(a) Switch Trees: switch tree {Xq, X 7 } to Group 0, switch tree 
{Xg} to Group 1 


according to the tree structure: 
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Here, A j is the set of parents of Xj in graph G. This set 
includes the parent X Pj of Xj unless Xj is a root, and Y if 
Xj is in Group 1, as shown in Table 1. We assume that the 
distributions of the class label Y and the graph structure G 
are independent a priori. 

Let Vj and Wj be the number of possible values for Xj and 
A j respectively. Then our hierarchical model for Xj is 

[Xj\Aj = A ji, Sji = Oji] ~ Mult (Oji), l = 1,.. • ,Wj 

Sji ~ Dirichlet ( —— l Vj J 

\ W 3 V 3 J 

Each conditional Multinomial model has a different pa¬ 
rameter vector Sji. We consider the Dirichlet hyperpa¬ 
rameters to represent “pseudo-counts” in each conditional 
model [Friedman et al., 1997]. Let rijki be the number 
of observations in the training data with Xj = Xjk and 
A j = A ji, and nji = Ylk=i n jki■ Then 

Wj Vj 

1=1 k=l 


We then integrate out the nuisance parameters Sji , l = 
1,..., Wj. The resulting likelihood depends only on the 
hyperparameter a and the counts of observations for each 
combination of values of Xj and A j. 


p(Xj\Aj )=n 




(#+"*) 


n 

TT 

G 

\ W 3 

Vj + 

11 

k=l 

r 1 

Wj Vj J 
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This is the Bayesian Dirichlet score, which satisfies like¬ 
lihood equivalence [Heckerman et al., 1995]. Namely, 
reparametrizations of the model that do not affect the con¬ 
ditional independence relationships between the features, 
for example by pivoting a tree to a different root, do not 
change the likelihood. 


(b) Reassign Subtree: reassign node Xq to be a child of node 

X 8 



(c) Pivot Trees: nodes Xq and X ± 0 become tree roots 


Figure 2: Example MCMC updates applied to the graph in 
Figure 1 


3.2. MCMC Updates 

Switch Trees: Randomly choose trees T\, ... , without 
replacement (we use k = 10, and propose switching 
each tree to the opposite group one by one (see Figure 
2a). This is a repeated Metropolis update. 

Reassign Subtree: Randomly choose a node Xj , detach 
the subtree rooted at this node and choose a different 
parent node for this subtree (see Figure 2b). This is a 
Gibbs update, so it is always accepted. 

We consider the set of nodes Xj> that are not descen¬ 
dants of Xj as candidate parent nodes (to avoid creat¬ 
ing a cycle), with corresponding graphs Gj>. We also 
consider a “null parent” option for each group, where 
Xj becomes a root in that group, with corresponding 
graph Gi for group i. Choose a graph G* from this 
set according to the conditional posterior distribution 
7 r(G*) (conditioning on the parents of all the nodes 
except Xj , and on the group membership of all the 
nodes outside the subtree). The subtree joins the group 
of its new parent. 

As a special case, this results in a tree merge if Xj was 
a root node, or a tree split if Xj becomes a root (i.e. 
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the new parent is null). Note that the new parent can 
be the original parent, in which case the graph does 
not change. 

Pivot Trees: Pivot all the trees by randomly choosing a 
new root for each tree (see Figure 2c). By likelihood 
equivalence, this update is always accepted. 

For computational efficiency, in practice we don’t 
pivot all the trees at each iteration. Instead, we just 
pivot the tree containing the chosen node Xj within 
each Reassign Subtree move, since this is the only 
time the parametrization of a tree matters. This im¬ 
plementation produces an equivalent sampling mech¬ 
anism. 


Table 2: Data set properties [Friedman et al., 1997] 


Data set 

#Features 

#Classes 

#Instances 
Train Test 

australian 

14 

2 

690 

CV-5 

breast 

10 

2 

683 

CV-5 

chess 

36 

2 

2130 

1066 

cleve 

13 

2 

296 

CV-5 

corral 

6 

2 

128 

CV-5 

crx 

15 

2 

653 

CV-5 

diabetes 

8 

2 

768 

CV-5 

flare 

10 

2 

1066 

CV-5 

german 

20 

2 

1000 

CV-5 

glass 

9 

6 

214 

CV-5 

glass2 

9 

2 

163 

CV-5 

heart 

13 

2 

270 

CV-5 

hepatitis 

19 

2 

80 

CV-5 

iris 

4 

3 

150 

CV-5 

letter 

16 

26 

15000 

5000 

lymphography 

18 

4 

148 

CV-5 

mofn-3-7-10 

10 

2 

300 

1024 

pima 

8 

2 

768 

CV-5 

satimage 

36 

6 

4435 

2000 

segment 

19 

7 

1540 

770 

shuttle-small 

9 

6 

3866 

1934 

soybean-large 

35 

19 

562 

CV-5 

vehicle 

18 

4 

846 

CV-5 

vote 

16 

2 

435 

CV-5 

waveform-21 

21 

3 

300 

4700 

ad 

1558 

2 

2276 

988 

arcene 

10000 

2 

100 

100 

arcene-cv 

10000 

2 

200 

CV-5 

gisette 

5000 

2 

6000 

1000 

isolet 

617 

26 

6238 

1559 

madelon 

500 

2 

2000 

600 

microsoft 

294 

2 

32711 

5000 


3.3. Classification Using Bayesian Model Averaging 

Graphs are sampled from the posterior distribution using 
the MCMC algorithm. We apply Bayesian model averag¬ 
ing [Hoeting et al., 1998] rather than using the posterior 
mode for classification. For each possible class, we aver¬ 
age the probabilities over a thinned subset of the sampled 


Table 3: SBFC runtime on high-dimensional data sets in 
minutes 


Data set 

Runtime (min) 

ad 

5 

arcene 

60 

arcene-cv 

65 

gisette 

134 

isolet 

23 

madelon 

1 

microsoft 

2 


graph structures, and then choose the class label with the 
highest average probability. Given a test data point cc test , 
we find 

P(Y = y\X = x test ,X,y) 
s 

<xJ2 p (Y = y\X = x test , Gi)P(Gi\X, y) 

where S is the number of graphs sampled by MCMC (after 
thinning by a factor of 50). We use training data counts to 
compute the posterior probability of the class label given 
each sampled graph G{. 

4. Experiments 

We compare our classification performance with the fol¬ 
lowing methods. 

BART: Bayesian Additive Regression Trees, R package 
BayesTree [Chipman et al., 2010], 

C5.0: Rpackage C50 [Quinlan, 1993], 

CART: Classification and Regression Trees, R package 
tree [Breiman et al., 1984], 

Lasso: R package glmnet [Friedman et al., 2010], 

LR: logistic regression, 

NB: Naive Bayes, R package e 10 71 [Duda & Hart, 1973] 

RF: Random Forest, R package ranger [Breiman, 2001], 

SVM: Support Vector Machines, R package el0 71 [Ev- 
geniou et al., 2000], 

TAN: Tree-Augmented Naive Bayes, R package 
bnlearn [Friedman et al., 1997]. 

We use 25 small benchmark data sets used by Friedman 
et al. [1997] and 6 high-dimensional data sets [Guyon et al., 
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Figure 3: Classification accuracy on low- and high-dimensional data sets, showing average accuracy over 5 runs for each 
method, with the top half of the methods in bold for each data set. Note that some of the classifiers could not handle 
multiclass data sets, and TAN timed out on the highest-dimensional data sets. SBFC performs competitively with SVM, 
TAN and some decision tree methods (BART and RF), and generally outperforms the others. 


2005], all from the UCI repository [Lichman, 2013], de¬ 
scribed in Table 2. We split the large data sets into a train¬ 
ing set and a test set, and use 5-fold cross validation for 
the smaller data sets (we try both approaches for the high¬ 
dimensional arcene data set). We remove the instances 
with missing values, and discretize continuous features, 
using Minimum Description Length Partitioning [Fayyad 
& Irani, 1993] for the small data sets and binary binning 
[Dougherty et al., 1995] for the large ones. For a data set 
with d features, we run SBFC for max(10000,10 d) iter¬ 
ations, which has empirically been sufficient for stabiliza¬ 
tion. Figure 3 compares SBFC’s classification performance 
to the other methods. 


We evaluate SBFC’s feature selection and interaction de¬ 
tection performance on the data sets heart, corral, and 
made Ion, in Figures 4, 5, and 6 respectively. We compare 
SBFC’s feature selection performance to Lasso, as well as 
RF’s importance metric and BART’s varcount met¬ 
ric, which rank features by their influence on classification, 
in Figures 4c, 5e, 5f, and 6c. We illustrate the structures 
learned by SBFC on these data sets using sampled graphs, 
shown in Figures 4a, 5a, 5b, and 6a, and average graphs 
over all the MCMC samples, shown in Figures 4b, 5c, 5d, 
and 6b. 
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(b) Average graph for heart data set 
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(c) Feature selection comparison for heart data set 


Figure 4: The sampled graph in Figure 4a and the average graph in Figure 4b show feature and interaction selection for the 
heart data set with features of medical significance. The dark-shaded features in the average graph are the most relevant 
for predicting heart disease. There are several groups of relevant interacting features: (Sex, Thalassemia), (Chest Pain, 
Angina), and (Max Heart Rate, ST Slope, ST Depression). The features in each group jointly affect the presence of heart 
disease. Figure 4c compares feature rankings with other methods, showing that all the methods agree on the top 9 features, 
but SBFC disagrees with the other methods on the top 3 features. 


In the average graphs, the nodes are color-coded accord¬ 
ing to relevance, based on the proportion of sampled 
graphs where the corresponding feature appeared in Group 
1 (dark-shaded nodes appear more often). Edge thickness 
also corresponds to relevance, based on the proportion of 
samples where the corresponding feature interaction ap¬ 
peared. To avoid clutter, only edges that appear in at least 
10% of the sampled graphs are shown, and nodes that ap¬ 
pear in Group 0 more than 80% of the time are omitted for 
high-dimensional data sets. Average graphs are undirected 
and do not necessarily have a tree structure. They provide 
an interpretable visual summary of the relevant features and 
feature interactions. 

As shown in Table 3, the runtime of SBFC scales approx¬ 
imately asd-n-2-10 -4 seconds (on an AMD Opteron 
6300-series processor), so it takes somewhat longer to run 
than many of the other methods on high-dimensional data 


sets. SBFC’s memory usage scales quadratically with d. 

5. Conclusion 

Selective Bayesian Forest Classifier is an integrated tool for 
supervised classification, feature selection, interaction de¬ 
tection and visualization. It splits the features into signal 
and noise groups according to their relationship with the 
class label, and uses tree structures to model interactions 
among both signal and noise features. The forest depen¬ 
dence structure gives SBFC modeling flexibility and com¬ 
petitive classification performance, and it maintains good 
feature and interaction selection performance as the sig¬ 
nal to noise ratio decreases. Useful directions for future 
work include extending SBFC to a semi-supervised learn¬ 
ing method, and improving runtime and memory perfor¬ 
mance. 






Interpretable Selection and Visualization of Features and Interactions Using Bayesian Forests 



Group 0 
X5 





(a) A sampled graph for the original (b) A sampled graph for the augmented corral data set with 100 features 

corral data set with 6 features 



(c) Average graph for the original 
corral data set with 6 features 






X16 


(d) Average graph for the augmented corral data set 
with 100 features 
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(e) Feature selection comparison for the (f) Feature selection comparison for the augmented corral data 

original corral data set with 6 features set with 100 features 


Figure 5: In the synthetic data set corral, the true feature structure is known: the relevant features are 
{Xi, X 2 , X 3 , X 4 , X 6 }, and the most relevant edges are {Xl, W 2 }, {X 3 ,X 4 }, while the other edges between the first 
4 features are less relevant, and any edges with X$ or X 6 are not relevant. The sampled graph in Figure 5a and the average 
graph in Figure 5c show that SBFC recovers the true correlation structure between the features, with the most relevant edges 
appearing the most frequently (as indicated by thickness). We generate extra noise features for this data set by choosing 
an existing feature at random and shuffling the rows, making it uncorrelated with the other features. The sampled graph in 
Figure 5b and the average graph in Figure 5d show that SBFC recovers the relevant features and some relevant interactions 
when the amount of noise increases. Figures 5e and 5f show that all the methods consistently rank the 5 relevant features 
(colored blue) above the rest (colored red). 
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(a) A sampled graph for made Ion data set 
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Figure 6: Feature and edge selection for the synthetic made Ion data set, used in the 2003 NIPS feature selection challenge. 
This data set, with 20 relevant features and 480 noise features, was artificially constructed to illustrate the difficulty of 
selecting a feature set when no feature is informative by itself, and all the features are correlated with each other [Guyon 
et al., 2005]. SBFC reliably selects the correct set of 20 relevant features [Guyon et al., 2006], as shown in Figure 6c, and 
appropriately puts them in a single connected component, shown in dark blue in the average graph in Figure 6b. As shown 
in Figure 6c, none of the other methods correctly identify the set of 20 relevant features (colored blue), though Random 
Forest comes close with 19 out of 20 correct. Our classification performance on this data set is not as good as that of BART 
or RF, likely because SBFC constrains these highly correlated features to form a tree structured Bayesian network, while a 
decision tree structure allows a feature to appear more than once. 
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(c) Feature selection comparison for made Ion data set 
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